Frequencies of $k$-mers in sequences are sometimes used as a basis forinferring phylogenetic trees without first obtaining a multiple sequencealignment. We show that a standard approach of using the squared-Euclideandistance between $k$-mer vectors to approximate a tree metric can bestatistically inconsistent. To remedy this, we derive model-based distancecorrections for orthologous sequences without gaps, which lead to consistenttree inference. The identifiability of model parameters from $k$-merfrequencies is also studied. Finally, we report simulations showing thecorrected distance out-performs many other $k$-mer methods, even when sequencesare generated with an insertion and deletion process. These results haveimplications for multiple sequence alignment as well, since $k$-mer methods areusually the first step in constructing a guide tree for such algorithms.
展开▼
机译:序列中$ k $ -mers的频率有时被用作推断系统发生树的基础,而无需首先获得多重序列比对。我们表明,在$ k $ -mer向量之间使用平方欧几里得距离来近似树度量的标准方法可能在统计上不一致。为了解决这个问题,我们导出了没有间隙的直系同源序列的基于模型的距离校正,这导致了一致的树推断。还研究了从$ k $ -merfrequency的模型参数的可识别性。最后,我们报告的模拟结果显示,即使使用插入和删除过程生成序列,校正后的距离也优于其他许多方法。这些结果也具有多重序列比对的意义,因为$ k $ -mer方法通常是构建用于此类算法的指导树的第一步。
展开▼